Pipelined completion for asynchronous communication

ABSTRACT

An asynchronous circuit having a pipelined completion mechanism to achieve improved throughput.

[0001] This application is a continuation-in-part of U.S. applicationSer. No. 09/118,140, filed on Jul. 16, 1998 and claims the benefit ofU.S. provisional application No. 60/058,662, filed on Sep. 12, 1997. Thedisclosure of the above two applications is incorporated herein byreference.

FIELD OF THE INVENTION

[0002] The present invention relates to information processing, and morespecifically to architecture and operation of asynchronous circuits andprocessors.

BACKGROUND

[0003] Many information processing devices operate based on a controlclock signal to synchronize operations of different processingcomponents and therefore are usually referred to as “synchronous”processing devices. In general, different processing components mayoperate at different speeds due to various factors including the natureof different functions and different characteristics of the componentsor properties of the signals processed by the components.Synchronization of these different processing components requires thespeed of the control clock signal to accommodate the slowest processingcomponent. Thus, some processing components may complete respectiveoperations earlier than other slow components and have to wait until allprocessing components complete their operations. Although the speed of asynchronous processor can be improved by increasing the clock speed to acertain extent, synchronous processing is not an efficient way ofutilizing available resources.

[0004] An alternative approach, pioneered by Alain Martin of CaliforniaInstitute of Technology, eliminates synchronization of differentprocessing components according to a clock signal. Different processingcomponents simply operate as fast as permitted by their structures andoperating environments. There is no relationship between a clock speedand the operation speed. This obviates many technical obstacles in asynchronous processor and can be used to construct an “asynchronous”processor with a much simplified architecture and a fast processingspeed that are difficult to achieve with synchronous processors.

[0005] U.S. Pat. No. 5,752,070 to Martin and Burns discloses such anasynchronous processor, which is incorporated herein by reference in itsentirety. This asynchronous processor goes against the conventionalwisdom of using a clock to synchronize various components and operationsof the processor and operates without a synchronizing clock. Theinstructions can be executed as fast as the processing circuits allowand the processing speed is essentially limited only by delays cased bygates and interconnections.

[0006] Such an asynchronous processor can be optimized for high-speedprocessing by special pipelining techniques based on unique propertiesof the asynchronous architecture. Asynchronous pipelining allowsmultiple instructions to be executed at the same time. This has theeffect of executing instructions in a different order than originallyintended. An asynchronous processor compensates for this out-of-orderexecution by maintaining the integrity of the output data without asynchronizing clock signal.

[0007] A synchronous processor relies on the control clock signal toindicate when an operation of a component is completed and when the nextoperation of another component may start. By eliminating suchsynchronization of a control clock, a pipelined processing component inan asynchronous processor, however, generates a completion signalinstead to inform the previous processing component the completion of anoperation.

[0008] For example, assume P1 and P2 are two adjacent processingcomponents in an asynchronous pipeline. The component P1 receives andprocesses data X to produce an output Y. The component P2 processes theoutput Y to produce a result Z. At least two communication channels areformed between P1 and P2: a data channel that sends Y from P1 to P2 anda request/acknowledgment channel by which P2 acknowledges receiving of Yto P1 and requests the next Y from P1. The messages communicated to P1via the request/acknowledgment channel are produced by P2 according to acompletion signal internal to P2.

[0009] Generation of this completion signal can introduce an extra delaythat degrades the performance of the asynchronous processor. Such extradelay is particularly problematic when operations of a datum aredecomposed into two or more concurrent elementary operations ondifferent portions of the datum. Each elementary operation requires acompletion signal. The completion signals for all elementary operationsare combined into one global completion signal that indicates completionof operations on that datum. Hence, a completion circuit (“completiontree”) is needed to collect all elementary completion signals togenerate that global completion signal. The complexity of such acompletion tree increases with the number of the elementary completionsignals.

[0010] When not properly implemented, the extra delays of a completiontree can significantly offset the advantages of an asynchronousprocessor. Therefore, it is desirable to reduce or minimize the delaysin a completion tree.

SUMMARY

[0011] The present disclosure provides a pipelined completion tree forasynchronous processors. A high throughput and a low latency can beachieved by decomposing any pipeline unit into an array of simplepipeline blocks. Each block operates only on a small portion of thedatapath. Global synchronization between stages, when needed, isimplemented by copy trees and slack matching.

[0012] More specifically, one way to reduce the delay in the completiontree uses asynchronous pipelining to decompose a long critical cycle ina datapath into two or more short cycles. One or more decoupling buffersmay be disposed in the datapath between two pipelined stages. Anotherway to reduce the delay in the completion tree is to reduce the delaycaused by distribution of a signal to all N bits in an N-bit datapath.Such delay can be significant when N is large. The N-bit datapath canalso be partitioned into m small datapaths of n bits (N=m×n) that areparallel to one another. These m small datapaths can transmit datasimultaneously. Accordingly, each N-bit processing stage can also bereplaced by m small processing blocks of n bits.

[0013] One embodiment of the asynchronous circuit uses the above twotechniques to form a pipelined completion tree in each stage to processdata without a clock signal. This circuit comprises a first processingstage receiving an input data and producing a first output data, and asecond processing stage, connected to communicate with said firstprocessing stage without prior knowledge of delays associated with saidfirst and second processing stages and to receive said first output datato produce an output. Each processing stage includes:

[0014] a first register and a second register connected in parallelrelative to each other to respectively receive a first portion and asecond portion of a received data,

[0015] a first logic circuit connected to said first register to producea first completion signal indicating whether all bits of said firstportion of said received data are received by said first register,

[0016] a second logic circuit connected to said second register toproduce a second completion signal indicating whether all bits of saidsecond portion of said received data are received by said secondregister,

[0017] a third logic circuit connected to receive said first and secondcompletion signals and configured to produce a third completion signalto indicate whether all bits of said first and second portions of saidreceived data are received by said first and second registers,

[0018] a first buffer circuit connected between said first logic circuitand the third logic circuit to pipeline said first and third logiccircuits, and

[0019] a second buffer circuit connected between said second logiccircuit and the third logic circuit to pipeline said second and thirdlogic circuits

[0020] These and other aspects and advantages will become more apparentin light the following accompanying drawings, the detailed description,and the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021]FIG. 1 shows two communicating processing stages in anasynchronous pipeline circuit based on a quasi-delay-intensivefour-phase handshake protocol.

[0022]FIG. 2 shows a prior-art completion tree formed by two-inputC-elements.

[0023]FIG. 3A is a simplified diagram showing the asynchronous pipelinein FIG. 1.

[0024]FIG. 3B shows an improved asynchronous pipeline with a decouplingbuffer connected between two processing stages.

[0025]FIG. 3C shows one implementation of the circuit of FIG. 3D using aC-element as the decoupling buffer.

[0026]FIG. 4 shows an asynchronous circuit implementing a pipelinedcompletion tree and a pipelined distribution circuit in each processingstage.

[0027]FIG. 5 shows a copy tree circuit.

[0028]FIG. 6 shows one embodiment of the copy tree in FIG. 5.

[0029]FIG. 7A is a diagram illustrating decomposition of an N-bitdatapath of an asynchronous pipeline into two or more parallel datapathswith each having a processing block to process a portion of the N-bitdata.

[0030]FIG. 7B is a diagram showing different datapath structures atdifferent stages in an asynchronous pipeline.

[0031]FIG. 7C shows a modified circuit of the asynchronous pipeline inFIG. 7A where a processing stage is decomposed into two pipelined smallprocessing stages to improve the throughput.

[0032]FIG. 8 shows an asynchronous circuit having a control circuit tosynchronize decomposed processing blocks of two different processingstages.

[0033]FIG. 9A shows a balanced binary tree.

[0034]FIG. 9B shows a skewed binary tree.

[0035]FIG. 9C shows a 4-leaf skewed completion tree.

[0036]FIG. 9D shows a 4-leaf balanced completion tree.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0037] The asynchronous circuits disclosed herein are quasidelay-insensitive in the sense that such circuits do not use anyassumption on, or knowledge of, delays in most operators and wires. Oneof various implementations of such quasi-delay-insensitive communicationis a four-phase protocol for communication between two adjacentprocessing stages in an asynchronous pipeline. This four-phase protocolwill be used in the following to illustrate various embodiments andshould not be construed as limitations of the invention.

[0038]FIG. 1 is a block diagram showing the implementation of thefour-phase protocol in an asynchronous pipeline. Two adjacent stages (orprocessing components) 110 (“A”) and 120 (“B”) are connected to send anN-bit data from the first stage 110 to the second stage 120 via datachannels 130. A communication channel 140 is implemented to send arequest/acknowledgment signal “ra” by the second stage 120 to the firststage 110. The signal ra either requests data to be sent or acknowledgesreception of data to the first stage 110. The processing stages 110 and120 are not clocked or synchronized to a control clock signal.

[0039] The first stage 110 includes a register part R_(A), 112, and acontrol part “C_(A)”, 114. The register part 112 stores data to be sentto the second stage 120. The control part 114 generates an internalcontrol parameter “x” 116 to the data channels 130, e.g., triggeringsending data or resetting the data channels. The control part 114 alsocontrols data processing in the first stage 110 which generates the datato be sent to the second stage 120. The second stage 120 includes aregister part 122 that stores received data from register part 112, acontrol part “C_(B)”, 124, that generates the request/acknowledgmentsignal ra over the channel 140 and controls data processing in thesecond stage, and a completion tree 126 that connects the register part122 and the control part 124.

[0040] The completion tree 126 is a circuit that checks the status ofthe register part 122 and determines whether the processing of thesecond stage 120 on the received data from the first stage 110 iscompleted. An internal control parameter “y” 128 is generated by thecompletion tree 126 to control the operation of the control part 224.

[0041] One possible four-phase handshake protocol is as follows. Whenthe completion tree 126 detects that the second stage 120 has completedprocessing of the received data and is ready to receive the next datafrom the first stage 110, a request signal is generated by the controlpart 124 in response to a value of the control parameter y (128) and issent to the control part 114 via the channel 140 to inform the firststage 110 that the stage 120 is ready to receive the next data. This isthe “request” phase.

[0042] Next, in a data transmission phase, the first stage 110 respondsto the request by sending out the next data to the second stage 120 viathe data channels 130. More specifically, the control part 114 processesthe request from the control part 124 and instructs the register part112 by using the control parameter x (116) to send the next data.

[0043] An acknowledgment phase follows. Upon completion of receiving thedata from the first stage 110, the completion tree 126 changes the valueof the control parameter y (128) so that the control part 124 producesan acknowledgment signal via the channel 140 to inform the first stage110 (i.e., the control part 114) of completion of the data transmission.

[0044] Finally, the control part 114 changes the value of the controlparameter x (116) which instructs the register part 112 to stop datatransmission. This action resets the data channels 130 to a “neutral”state so that the next data can be transmitted when desired. Inaddition, the completion tree 126 resets the value of the controlparameter y to the control part 124 to produce another request. Thiscompletes an operation cycle of request, data transmission,acknowledgment, and reset.

[0045] No clock signal is used in the above communication protocol. Eachprocessing component or stage operates as fast as possible to complete arespective processing step and then proceeds to start the nextprocessing step. Such asynchronous pipelined operation can achieve aprocessing speed, on average, higher than that of a synchronousoperation.

[0046] Since the operation is asynchronous, the binary data should becoded with delay-insensitive codes. One simple way of coding data in adelay-insensitive manner is a “dual-rail” code in which each bit isencoded on two wires. Another delay-insensitive code is a 1-of-N code inwhich one rail is raised for each bit value of the data. See, e.g., U.S.Pat. No. 3,290,511. A delay-insensitive code is characterized by thefact that the data rails alternate between a neutral state that doesn'trepresent a valid encoding of a data value, and a valid state thatrepresents a valid encoding of a data value. See, Alain J. Martin,“Asynchronous Data paths and the Design of an Asynchronous Adder” inFormal Methods in System Design, 1:1, Kluwer, 117-137, 1992.

[0047] The above four-phase protocol can be broken down into a set phaseand a rest phase. The set phase includes the sequence of transitionsperformed in the request phase and the transmission phase (assuming thatall wires are initially set low):

ra↑;x↑;D

;y↑.

[0048] Each transition is a process where a signal (e.g., ra, x, D, ory) changes its value. The reset phase includes the sequence oftransitions in the acknowledgment phase and the final reset phase:

ra↓;x↓;D

;y↓.

[0049] The above operations are expressed in the handshake expansion(“HSE”) notation as defined in the incorporated U.S. Pat. No. 5,752,070.The semicolon indicates two statements to be executed in sequence; thev↑ and v↓ set a boolean variable v to true and false, respectively; C

is the concurrent assignment of some bits of C such that the result isan appropriate valid value without any intermediate value being valid;and C

is the concurrent assignment of some bits of C such that the result is aneutral value without any intermediate value being neutral.

[0050] The false value, y↓, of the completion signal y representscompletion of processing and instructs the control part 124 to send outa request. The true value, y↑, represents completion of receiving dataand instructs the control part 124 to send out an acknowledgment. Thearchitecture of the completion tree 126 and the generation of thecompletion signals, y↓ and y↑, are now described in detail.

[0051] Consider an N-bit datum, D that is transmitted from the firststage 110 to the second stage 120. The completion signal y↑ is generatedwhen all the bits encoded into D have been written into the register 122from the register 112. For each bit b_(k) (k=0, 1, . . . , N−1), awrite-acknowledgment signal, wack_(k), is generated. When allwrite-acknowledgment signals are raised, y can be raised to produce thecompletion signal y↑. Similarly, wack_(k) is lowered when thecorresponding bit b_(x) is reset to its neutral value according to achosen delay-insensitive protocol. Hence, y can be reset to zero whenall write-acknowledgment signals are reset to zero (the neutral value).This can be expressed as the following:

wack₀

wack₁

. . .

wack_(N−1)→y↑

wack₀

wack₁

. . .

wack_(N−1)→y↓

[0052] where the notation “

” represents negation, thus if wack₀ represents a “high”,

wack₀ represents a “low”.

[0053] The completion tree 126 is constructed and configured to performthe above logic operations to generate the proper completion signals(i.e., either y↓ and y↑). For any reasonably large value of N, oneconventional implementation of the completion tree uses a tree oftwo-input C-elements as shown in FIG. 2. The two-input C-element (alsoknown as Muller C element) is a logic gate which outputs a high or lowonly when both inputs are high or low, respectively, and the outputremains unchanged from a previous value if the inputs are different fromeach other.

[0054] The number of C-elements in FIG. 2 may be reduced by usingC-elements of more than two inputs, such as three or even four inputs.However, the existing VLSI technology limits the number of inputs insuch C-elements since as the number of p-transistors connected in seriesto form the C-elements increases, the performance of the C-elements isusually degraded. In general, the number of the inputs of a C-elementmay be up to 4 with acceptable performance.

[0055] Measurements show that the type of C-element used to constructthe completion tree is in general not very important. What is importantis that whatever tree is used, the delay through the tree isproportional to logN. The delays through the tree are roughly a constantfor C-elements of two inputs, three inputs, or four inputs.

[0056] The two communicating components are said to complete a “cycle”if, after a sequence of transitions, both components return to theirrespective initial states at the beginning of the sequence. For example,the set phase and the reset phase of transitions in communicationbetween A and B shown in FIG. 1 form a cycle C:

C≡ra↑;x↑;D

;y↑;ra↓;x↓;D

;y↓.

[0057] The throughput of an asynchronous system is determined by thedelay through the longest cycle of transitions. Such a cycle is called a“critical cycle.” Therefore, it is desirable to reduce the criticalcycle to improve the throughput.

[0058] For a quasi-delay-insensitive asynchronous system in which anytwo components communicate according to the above four-phase protocol, adelay δc through the sequence C is a good estimated lower-bound for thecritical cycle delay.

[0059] For a normal datapath with n=32 or n=64, the completion treedelays, δ(y↑) and δ(y↓), may be unacceptable, if a high throughput isrequired. For example, in the Caltech MiniMIPS design, the targetthroughput in the 0.6-μm CMOS technology is around 300 MHZ. The criticalcycle delay is thus about 3 ns. For a full 32-bit completion tree basedon the structure shown in FIG. 2, the completion tree delay is around 1ns. Hence, one third of the critical cycle delay is caused by thecompletion tree. This is a significant portion of the critical delay.

[0060] The significant contribution to the critical cycle delay from thecompletion tree is a common drawback of previous asynchronous systems.To certain extent, such a limitation has prevented many from developingasynchronous systems as an alternative to the dominating synchronoussystems in spite of many advantages of asynchronous systems. Hence, itis important to design and configure a completion tree with asignificantly reduced delay to make an asynchronous system practical.

[0061] One way to reduce the delay in the completion tree usesasynchronous pipelining to decompose a long critical cycle in a datapathinto two or more short cycles. FIGS. 3A and 3B show an example ofbreaking a long critical cycle between two pipelined stages A and B intotwo short cycles by pipelining A and B through a buffer.

[0062]FIG. 3A shows two components 310 (A) and 320 (B) communicate witheach other through two simple handshake channels 312 (a) and 322 (b).The protocol may include the following sequence of transitions:

A↑;a↑;B↑;b↑;A↓;a↓;B↓;b↓

[0063] where A↑,B↑,A↓,B↓ represent the transitions inside A and B. Ifthe delay through this cycle is too long to be acceptable (e.g., due tothe delays through A and B), a simple buffer 330 can be introduced toform an asynchronous pipelining between A and B as in FIG. 3B to reducethis long cycle into two short cycles.

[0064] The buffer 330 creates two handshake cycles:

Bu1=A↑;a1↑;b1↑;A↓;a1↓;b1↓,

and

Bu2=a2↑;B↑;b2↑;a2↓;B↓;b2↓.

[0065] If the delays of the transitions Bu1↑, Bu1↓ and Bu2↑, Bu2↓ areshorter than the delays of A↑, A↓ and B↑, B↓, the above decompositionreduces the length of the critical cycle.

[0066] The two handshakes are synchronized by the buffer, not by a clocksignal. The buffer can be implemented in various ways. FIG. 3C shows onesimple implementation that uses a single C-element 340 of two inputs a1,b2 and two outputs a2, b1. The C-element 340 receives the input a1 andan inverted input of b2 to produce two duplicated outputs a2, b1. Thetwo handshakes are synchronized in the following way:

[0067] This particular buffer allows the downgoing phase of A to overlapwith the upgoing phase of B and the upgoing phase of A to overlap withthe downgoing phase of B. Such overlap reduces the duration of thehandshaking process.

[0068] Therefore, when a decoupling buffer is properly implemented,adding additional stages in an asynchronous pipeline may not necessarilyincrease the forward latency of the pipeline and may possibly reduce theforward latency.

[0069] The above technique of decomposing a long cycle into two or morepipelined short cycles can reduce the delay along the datapath of apipeline. However, this does not address another delay caused bydistribution of a signal to all N bits in an N-bit datapath, e.g.,controlling bits in a 32-bit register that sends out data (e.g., theregister 112 in the stage 110). Such delay can also be significant,specially when N is large (e.g., 32 or 64 or even 128). Hence, inaddition to adding additional pipelined stages along a datapath, anN-bit datapath can also be partitioned into m small datapaths of n bits(N=m×n) to further reduce the overall delay. These m small datapaths areconnected parallel to one another and can transmit data simultaneouslyrelative to one another. Accordingly, the N-bit register of a stage inthe N-bit datapath can also be replaced by m small registers of n bits.The number m and thereby n are determined by the processing tasks of thetwo communicating stages. A 32-bit datapath, for example, can bedecomposed into four 8-bit blocks, or eight 4-bit blocks, or sixteen2-bit blocks, or even thirty-two 1-bit blocks to achieve a desiredperformance.

[0070] Therefore, decomposition of a long cycle into two or more smallcycles can be applied to two directions: one along the pipelined stagesby adding decoupling buffers therebetween and another “orthogonal”direction by decomposing a single datapath into two or more smalldatapaths that are connected in parallel.

[0071]FIG. 4 shows a 32-bit asynchronous pipeline with a pipelinedcompletion tree based on the above two-dimensional decomposition. Four8-bit registers 401A, 401B, 401C, 401D in the sending stage 110 areconnected with respect to one another in parallel. Accordingly, four8-bit registers 402A, 402B, 402C, 402D in the receiving stage 120 thatrespectively correspond to the registers in the sending stage 110 arealso connected with respect to one another in parallel. This forms fourparallel 8-bit datapaths. Each datapath has an 8-input completion tree(e.g., 403A, etc.), and the four completion outputs ctk (k=1, 2, 3, and4) are combined into one 4-input completion tree 420 that produces acompletion signal 421 (ra) for the control 124. This accomplishes onehalf of the two-dimensional decomposition.

[0072] Decomposition along the datapaths is accomplished by using thedecoupling buffer shown in FIGS. 3B and 3C. A completion tree 410 isintroduced in the sending stage 110 to receive individualrequest/acknowledge signals rak (k=1, 2, 3, and 4) directly fromindividual 8-bit datapaths and thereby to produce a duplicaterequest/acknowledge signal 411 of the request/acknowledge signal 140produced by the control part 124. The control part 114 responds to thissignal 411 to control the registers 401A, 401B, 401C, and 401D to sendthe next data.

[0073] At least two decoupling buffers, such as 412A and 422A, areintroduced in each datapath with one in the sending stage 110 andanother in the receiving stage 120. The buffer 412A, for example, isdisposed on wires (ct1,ra1) to interconnect the control part 114, thecompletion tree 410, register 401A, and the request/acknowledge signalfor the first datapath. The buffer 422A is disposed on wires (x1, ra1)to interconnect the first completion tree 403A, the control part 124,the completion tree 420, and the completion tree 410.

[0074] Therefore, the completion trees 403A, 403B, 403C, and 403D arepipelined to the completion tree 420 via buffers 422A, 422B, 422C, and422D, respectively. Similarly, the completion trees in the stage 110 arealso pipelined through buffers 412A, 412B, 412C, and 412D. Suchpipelined completion significantly reduces the delay in generating thecompletion signal for the respective control part. The above decouplingtechnique can be repeated until all completion trees have a delay belowan acceptable level to achieve a desired throughput.

[0075] Additional buffers may be added in each datapath. For example,buffers 414 and 424 may be optionally added on wires (ra, x) and (ra, y)to decouple the control parts 114 and 124, respectively.

[0076] Since decoupling buffers may increase the latency of anasynchronous pipeline, a proper balance between the latency requirementand the throughput requirement should be maintained when introducingsuch buffers.

[0077] A stage in an asynchronous circuit usually performs both sendingand receiving. One simple example is a one-place buffer having aregister, an input port L, and an output port R. This buffer repeatedlyreceives data on the port L, and sends the data on the port R. Theregister that holds the data is repeatedly written and read.

[0078] It is observed that the completion mechanism for the control 114in the sending stage 110 and the completion mechanism for the control124 in the receiving stage 120 are similar in circuit construction andfunction. Since data is almost never read and written simultaneously,such similarity can be advantageously exploited to share a portion ofthe pipelined completion mechanism between sending data and receivingdata within a stage. This simplifies the circuit and reduces the circuitsize.

[0079] In particular, distributing the control signals from the controlpart in each stage to data cells and merging the signals from all datacells to the control part can be implemented by sharing many circuitelements. In FIG. 4, a portion of circuit, a “copy tree” is used in bothstages. This copy tree is shown in FIG. 5. The copy tree includes twopipelined circuits: a pipelined completion tree circuit for sending acompletion signal based on completion signals from data cells to theglobal control part in each stage and a pipelined distribution circuitfor sending control signals from the global control part to data cells.

[0080]FIG. 6 shows one embodiment of a copy tree for a stage that has kdata cells. This, copy tree is used for both distributing k controlsignals from the control part (e.g., 114 in FIG. 4) to all data cellsand merging k signals from all data cells to the control part. Thesignals r_(l), s_(i) are signals going to data cells, (l≦i≦k), asrequests to receive or send. The completion signal ct_(i) comes fromdata cell i, as a request/acknowledgment signal. One advantage of thiscopy tree is that only one completion tree is needed to perform thefunctions of the two completion trees 410 and 420 in FIG. 4.

[0081] The copy tree shown in FIG. 6 is only an example. Otherconfigurations are possible. In general, a program specification of acopy tree for both sending and receiving is as follows:

*[C?c;<∥i;l . . . k:D_(i)!c>]

[0082] where C is the channel shared with the control, D₁ . . . D_(x)are the channels to each data cell, and c is the value encoding therequest (receive, send, etc.). The different alternatives for the buffercorrespond to the different implementations of the semicolon.

[0083] In the above circuits, each data cell i contains a control partthat communicates with a respective copy tree through the channel D_(i).In certain applications, the copy tree and the control for each datacell may be eliminated.

[0084] Consider a data cell i that receives data from a channel L^(i),ands send out data to a channel R^(i). Assuming that the requests fromthe copy tree to the data cells are just receive (“r”) or send (“s”), aprogram specification of data cell i is:

*[[D^(i)=“r”→D^(i);L^(i)?xi

□D^(i)=“s”→D^(i);R^(i)!xi

]]

[0085] The program generalizes obviously to any number of requests.Again, we have the choice among all possible implementations of thesemicolon (the buffer between channel D_(i) and channel Li or Ri). Ifthe sequence of requests is entirely deterministic, like in the case ofa buffer: r,s,r,s, . . . , there is no need for each data cell tocommunicate with a central control process through the copy tree. Thefixed sequence of requests can be directly encoded in the control ofeach data cell, thereby eliminating the central control and the copytree. Hence, the control is entirely distributed among the data cells. Acentral control process is usually kept when the sequence of send andreceive actions in the data cells is data dependent.

[0086] One technique used in FIG. 4 is to decompose the N-bit data pathinto m small datapaths of n bits. Since each small datapath handles onlya small number of bits of the N bits, the data processing logic and thecontrol can be integrated together to form a single processing blockwithout having a separate control part and a register. The registers ineach stage shown in FIG. 4 can be eliminated. Therefore, the globalcontrol part in each stage is distributed into the multiple processingblocks in the small datapaths. Without the register, the data in eachprocessing block can be stored in a buffer circuit incorporated in theprocessing block. Such implementation can usually be accomplished basedon reshuffling of half buffer, precharged half buffer, and prechargedfull buffer disclosed in U.S. application Ser. No. 09/118,140, filed onJul. 16, 1998, which is incorporated herein by reference. Reshufflingcan be used to advantageously reduce the forward latency. FIG. 7A showsone embodiment of an asynchronous circuit by implementing multipleprocessing blocks.

[0087] In addition, the datapaths between different stages in an N-bitasynchronous pipeline may have different datapath structures to reducethe overall delay. The difference in the datapaths depends on the natureand complexity of these different stages. One part of the N-bitpipeline, for example, may have a single N-bit data path while anotherpart may have m n-bit datapaths. FIG. 7B shows three different datapathstructures implemented in four pipelined stages.

[0088]FIG. 7C shows another example of decomposing a long cycle intosmall cycles based on the circuit in FIG. 7A. The pipelined stage A canbe decomposed into two pipelined stages A1 and A2. Each processing blockof the stages A1 and A2 is simplified compared to the processing blockin the original stage A. Each stage, A1 or A2, performs a portion of theprocessing task of the original stage A. When A1 and A2 are properlyconstructed, the average throughput of the stages A1 and A2 is higherthan that of the original stage A.

[0089] Decomposition of an N-bit datapath into multiple small datapathsshown in FIG. 7A allows each small datapath to process and transmit aportion of the data. For example, the first small datapath handles bits1 through 8, the second small datapath handles bits 9 through 18, etc.As long as each small datapath can proceed entirely based on its ownportion of the data and independently of other data portions,synchronization of different small datapaths and a global completionmechanism are not needed. This rarely occurs in most practicalasynchronous processors except some local processing or pure bufferingof data. In a pipeline where the data is actually transformed, thepipelined stages are often part of a logic unit (e.g., a fetch unit or adecode unit). Each processing block in stage k+1 usually need read someinformation from two or more different processing blocks in the stage k.Hence, the decomposed small datapaths need to be synchronized relativeto one another.

[0090] One way to implement such synchronization is illustrated in FIG.8. A control circuit is introduced between the stage k and stage k+1 togather global information from each processing block of stage k andcomputes appropriate control signals to control the related processingblocks in stage k+1. Decomposed datapaths are not shown in FIG. 8. Forexample, the stage k compares two 32-bit numbers A and B and theoperations of the stage k+1 depends on the comparison result. Thecontrol circuit produces a control signal indicating the difference(A−B) based on the signals from the decomposed datapaths in the stage k.This control signal is then distributed to all decomposed blocks in thestage k+1.

[0091] One aspect of the control circuit is to synchronize theoperations of the two stages k and k+1. Similar to the connectionsbetween the control part 114 and the data cells in the stage 110 of FIG.4, a copy tree can be used to connected the control circuit to each ofthe stages k and k+1. To maintain a high throughput and reduce thelatency, the copy trees are preferably implemented as pipelinedcompletion circuits. For example, each processing block in the stage kis connected to a block completion tree for that block. The blockcompletion tree is then pipelined to a global completion tree via adecoupling buffer. The output of the global completion tree is thenconnected to the control circuit. This forms the pipelined completiontree in the copy tree that connects the stage k to the control circuit.

[0092] When the control circuit distributes a multi-valued controlsignal to stage k+1, the single control wire of a basic completion treeneeds to be replaced with a set of wires encoding the different valuesof the control signal. The copy tree shown in FIG. 6 can be extended inthe case of a two-valued signal encoded by wires r and s.

[0093] The control circuit in FIG. 8 can introduce an extra delaybetween the stage k and k+1, in particular since the pipelinedcompletion tree used usually has a plurality of decoupling buffers. Thisdelay can form a bottleneck to the speed of the pipeline. Therefore, itmay be necessary in certain applications to add buffers in a datapathbetween the stages k and k+1 in order to substantially equalize thelength of different channels between the two stages. This technique iscalled “slack matching”.

[0094] The above pipelined completion circuits are balanced binary treein which the distances from the root to leaves are a constant. FIG. 9Ashows a balanced binary tree. In general, a tree used in the presentinvention may not be balanced or binary. For example, a binary tree canbe skewed as shown in FIG. 9B. FIG. 9C shows a 4-leaf skewed completiontree and FIG. 9D shows a balanced 4-leaf completion tree.

[0095] The above embodiments provide a high throughput and a low latencyby decomposing any pipeline unit into an array of simple pipelineblocks. Each block operates only on a small portion of the datapath. Theglobal completion delay is essentially eliminated. Globalsynchronization between stages is implemented by copy trees and slackmatching.

[0096] Although only a few embodiments are disclosed, other variationsare possible. For example, the control circuit in FIG. 8 may beconnected between any two stages other than two adjacent stages asshown. Also, the number of decoupling buffers between two stages can bevaried. These and other variations and modifications are intended to beencompassed by the following claims.

What is claimed is:
 1. A pipelined completion element for a processor, comprising: a completion process producing a request and receiving an acknowledgment that said request has been completed from each of a plurality of different processes, including at least a first process, completing said request; and a pipelining element, including at least one buffer in said pipeline that senses a command for an action to occur in said completion process and returns an indication that said buffer has received said command. 